Linguistic analysis of differences in portrayal of movie characters

ABSTRACT

A computer implemented method for analyzing media content includes a step of providing a plurality of narrative files formatted in human readable format. Each narrative file includes a script and/or dialogues tagged with character names along with auxiliary information. Each script includes a plurality of portrayals performed by an associated actor or character. Linguistic representations of content of the narrative files in both abstract and semantic forms is determined. The linguistic representations are connected to higher order representations and mental states. The linguistic representations are connected to behavior and action. Interplay between language constructs and demographics of content creators is analyzed. Content representations towards individuals/groups are adapted to reflect heterogeneity in preferences.

CROSS-REFERENCE TO RELATED APPLICATION

This application claims the benefit of U.S. provisional application Ser.No. 62/560,954 filed Sep. 20, 2017, the disclosure of which isincorporated in its entirety by reference herein.

STATEMENT REGARDING FEDERALLY SPONSORED RESEARCH OR DEVELOPMENT

The invention was made with Government support under Contract No.1029373 awarded by the National Science Foundation. The Government hascertain rights to the invention.

TECHNICAL FIELD

In at least one aspect, the present invention relates to method foranalyzing movie content.

BACKGROUND

Movies are often described as having the power to influence individualbeliefs and values. In (Cape, 2003), the authors assert movies'influence in both creating new thinking patterns in previouslyunexplored social phenomena, especially in children, as well as theirability to update an individual's existing social boundaries based onwhat is shown on screen as the “norm”. Some authors claim the inverse:that movies reflect existing cultural values of the society, addingweight to their ability in influencing individual beliefs of what isaccepted as the norm. As a result, they are studied in multipledisciplines to analyze their influence.

Movies are particularly scrutinized in aspects involving negativestereotyping (Cape, 2003; Dimnik and Felton, 2006; Ter Bogt et al.,2010; Hedley, 1994) since this may introduce questionable beliefs inviewers. Negative stereotyping is believed to impact society in multipleaspects such as self-induced undermining of ability (Davies et al.,2005) as well as causing forms of prejudice that can impact leadershipor employment prospects (Eagly and Karau, 2002; Niven, 2006). Studies inanalyzing stereotyping in movies typically rely on collecting manualannotations on a small set of movies on which hypotheses tests areconducted (Behm-Morawitz and Mastro, 2008; Benshoff and Griffin, 2011;Hooks, 2009).

Language use has been long known as a strong indicator of the speaker'spsychological and emotional state (Gottschalk and Gleser, 1969) and iswell studied in a number of applications such as automatic personalitydetection (Mairesse et al., 2007) and psychotherapy (Xiao et al., 2015;Pennebaker et al., 2003). Computational analysis of language has beenparticularly popular thanks to advancements in computing and the ease ofconducting large scale analysis of text on computers (Pennebaker et al.,2015).

Previous works in studying representation in movies largely focus onrelative frequencies, particularly on character gender. In (Smith etal., 2014), the authors studied 120 movies from around the globe whichwere manually annotated to capture information about character gender,age, careers, writer gender and director gender. However, since theannotations are done manually, collecting information on new movies is alaborious process.

Automated analyses of movies using computational techniques to analyzerepresentation has recently gained some attention. In (NYFA, 2013;Polygraph, 2016), the authors examine differences in relative frequencyof female characters and note considerable disparities in gender ratioin these movies. However, the analyses there are limited to comparingrelative frequencies. In (Ramakrishna et al., 2015), the authors studydifference in language used in movies across genders by aone-dimensional analysis.

Accordingly, there is a need for improved methods and systems foranalyzing media content in an efficient scalable manner.

SUMMARY

In at least one aspect, the present invention solves one or moreproblems of the prior art by method for analyzing media content. Themethod includes a step of providing a plurality of narrative filesformatted in human readable format. Each narrative file includes ascript and/or dialogues tagged with character names along with auxiliaryinformation. Each script includes a plurality of portrayals performed byan associated actor or character. Linguistic representations of contentof the narrative files in both abstract and semantic forms isdetermined. The linguistic representations are connected to higher ordermental states and higher order constructs. The linguisticrepresentations are connected to behavior and action. Interplay betweenlanguage constructs and demographics of content creators is analyzed.Content representations towards individuals/groups are adapted toreflect heterogeneity in preferences. Typically, a computer system isoperable to perform at least one of or all of the steps of the method.

In another aspect, a large scale automated analyses of movie charactersusing language used in dialogs to study stereotyping along factors suchas gender, race and age is provided.

In another aspect, laborious annotation is avoided by estimating themetadata computationally thereby enabling efficient scaling up.

In yet another aspect, fine grained comparisons of character portrayalis performed using multiple language-based metrics along factors such asgender, race and age on a newly created corpus.

In still another aspect, a movie screenplay corpus (sail.usc.edu/mica/text_corpus_release.php) is constructed that includes a plurality(e.g., nearly 1000 or more) of movie scripts obtained from the Internetor other source of such files. For each movie in the corpus, additionalmetadata such as cast, genre, writers and directors, and also collectactor level demographic information such as gender, race and age areobtained or assigned. Two kinds of measures are used in the analyses:(i) linguistic metrics that capture various psychological constructs andbehaviors, estimated using dialogues from the screenplay; and (ii) graphtheoretic metrics estimated from character network graphs, which areconstructed to model inter-character interactions in the movie. Thelinguistic metrics include psycholinguistic normatives, which provideword level scores on a numeric scale which are then aggregated at thedialog level, and metrics from the Linguistic Inquiry and Word Countstool (LIWC) which capture usage of well-studied stereotyping dimensionssuch as sexuality. Centrality metrics are estimated from the characternetwork graphs to measure relative importance of the differentcharacters, which are analyzed with respect to the different factors ofgender, race and age.

Advantageously, embodiments and variations of the present inventionprovide: (i) a scalable analysis of differences in portrayal of variouscharacter subgroups in movies using their language use is presented,(ii) a new corpus with detailed annotations for the analysis isconstructed and (iii) highlighting of several differences in theportrayal of characters along factors such as race, age and gender.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1A: Linguistic representations used in some methods of theinvention.

FIG. 1B: Linguistic representations used in some methods of theinvention.

FIG. 2: Character networks as used in some methods of the invention.

FIG. 3: Synthesis as used in some methods of the invention.

FIG. 4: a schematic illustration of a computer system implementing themethods set forth above.

FIGS. 5A, 5B, 5C, 5D, 5E, 5F, 5G, 5H, 5I, 5J, 5K, 5L, 5M, 5N, 5O, and5P: Histogram of age for actors belonging to different gender and racialcategories with p-values on top; significant values at α=0.05 arehighlighted; *: no test performed since the female group is empty.

DETAILED DESCRIPTION

Reference will now be made in detail to presently preferredcompositions, embodiments and methods of the present invention, whichconstitute the best modes of practicing the invention presently known tothe inventors. The Figures are not necessarily to scale. However, it isto be understood that the disclosed embodiments are merely exemplary ofthe invention that may be embodied in various and alternative forms.Therefore, specific details disclosed herein are not to be interpretedas limiting, but merely as a representative basis for any aspect of theinvention and/or as a representative basis for teaching one skilled inthe art to variously employ the present invention.

Except in the examples, or where otherwise expressly indicated, allnumerical quantities in this description are to be understood asmodified by the word “about” in describing the broadest scope of theinvention. Practice within the numerical limits stated is generallypreferred.

It is also to be understood that this invention is not limited to thespecific embodiments and methods described below, as specific componentsand/or conditions may, of course, vary. Furthermore, the terminologyused herein is used only for the purpose of describing particularembodiments of the present invention and is not intended to be limitingin any way.

It must also be noted that, as used in the specification and theappended claims, the singular form “a,” “an,” and “the” comprise pluralreferents unless the context clearly indicates otherwise. For example,reference to a component in the singular is intended to comprise aplurality of components.

Throughout this application, where publications are referenced, thedisclosures of these publications in their entireties are herebyincorporated by reference into this application to more fully describethe state of the art to which this invention pertains.

Abbreviations:

“LIWC” “means linguistic inquiry and word counts.

“TFIDF” means term frequency-inverse document frequency.

The term “dimension” refers to the aforementioned social/physicalconstructs. For example, in prior published work we used dimensions suchas sexuality, power among others, and the like.

In an embodiment, a method for analyzing media content is provided.Advantageously, the steps of the method are computer implemented by acomputer having a computer processor operable to perform the steps ofthe method. The method includes a step of providing a plurality ofnarrative files formatted in human readable format. Examples of humanreadable formats include text files, word processor files (e.g., Word.doc files and .RTF files, PDF files, .txt files, and the like). Thenarrative files can be screenplay files or any story document with oneor more characters, description of their actions, and a narrative thatrepresents their interaction with one another. Each narrative fileincludes a script (e.g., a movie script) and dialogues tagged withcharacter names along with auxiliary information. The terms “narrativefile” and “script” as used herein will include movie scripts as well asany file that can be analyzed with the method set forth herein. In avariation, the auxiliary information includes, but is not limited to,shot location (interior/exterior), character placement and scenecontext. Characteristically, each script includes a plurality ofportrayals performed by an associated actor. In a refinement, thenarrative files are from a diverse set of writers and include asignificant amount of noise and inconsistencies in their structure.Linguistic representations of content of the narrative files aredetermined. In a refinement, this representation is in both abstract andsemantic forms (see, FIGS. 1A and 1B). In a refinement, the term“linguistic representations” refers to quantifying metrics which can bea number or sequence of numbers. The following papers describe semanticvector spaces in document analysis: [1] Mikolov, T., Sutskever, I.,Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representationsof words and phrases and their compositionality. In Advances in neuralinformation processing systems (pp. 3111-3119).apers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf;[2] Pennington, J., Socher, R., & Manning, C. (2014). Glove: Globalvectors for word representation. In Proceedings of the 2014 conferenceon empirical methods in natural language processing (EMNLP) (pp.1532-1543). www.aclweb.org/anthology/D14-1162; and [3] Kiros, R., Zhu,Y., Salakhutdinov, R. R., Zemel, R., Urtasun, R., Torralba, A., &Fidler, S. (2015). Skip-thought vectors. In Advances in neuralinformation processing systems (pp. 3294-3302).https://papers.nips.cc/paper/5950-skip-thought-vectors.pdf; the entiredisclosures of which are hereby incorporated by reference. Word levelrepresentations include psycholinguistic normatives such as valence orassociation with specific constructs such as anger, sexuality, etc. In arefinement, “construct” refers to a metric and/or measure forquantifying or describing a number of defining attributes. Theseattributes may have a degree of subjectivity in their definition andquantification. In a refinement, normatives are numeric representationsof expected content from a given stimulus such as word/image. Normativesconstructed from sentences on psychological constructs such as emotionare called as psycholinguistic normatives. The linguisticrepresentations are connected to higher order mental states andconstructs (i.e., constructs including higher order dimensions). Thelinguistic representations are connected to behavior and action.Examples for the behavior and action include, but are not limited to,humor, violence, language sophistication, and combinations thereof. Theinterplay between language constructs and demographics of contentcreators is analyzed. For example, interplay will describe how thelanguage trends between character demographics (e.g., gender, race, age)and their mental states, emotions, actions, and the like (e.g., femalesare happier). Content representations are adapted towardsindividuals/groups to reflect heterogeneity in preferences. In arefinement, adapt means that the content representations as used todetermine feature of the representation of individuals and/or groups andif there is diversity and/or heterogeneity in their representations. Ina refinement, multiple levels connect words, emotion, and personalitytraits. In this context, “connect” means that construct representationscan be constructed from lines in movies to various mental states andbehaviors. For example, numeric measures of emotional dimensions such asvalence (positive or negative connotation) and arousal (degree ofactivation) can be constructed. “Connects” can also be taken to mean“implies.”

In a variation, the method further includes a step of parsing thescreenplay or other narrative files to extract predetermined relevantinformation to output utterances and character names associated with theoutput utterances. Similar movies or stories are identified as aplurality of potential matches. Alignments (e.g., movie alignments) areformed by computing name alignment scores for each match as a percentageof character names from the screenplay or other narrative files for eachof the similar movies or movies. In a refinement, character names aremapped by term frequency-inverse document frequency (TFIDF) to compute aname alignment score. Target entries are identified as movies or storieshaving an alignment score higher than a predetermined value. In arefinement, similar movies are identified that have a close match with agiven screenplay or other narrative, manually if necessary. Demographicsincluding age, gender, sex, education, profession, and race data arecollected for each associated actor. Differences in portrayal ofcharacters is determined along the linguistic representations. Portrayaldifferences can be measured by psycholinguistic normatives that capturean underlying emotional state of a speaker. In another refinement,biases in the portrayals are determined with respect to age, gender, andrace. In a refinement, the alignments are manually corrected, to fixincorrect gender maps, and manually force a match if necessary.

In a variation, the method further includes a step of fetching metadatafor each parsed movie or story. A particularly useful source for thismetadata is IMDb.com. Examples of metadata include, but is not limitedto year of release, directors, writers, producers, performers, and othercreators of the content, and combinations thereof.

In another variation, a gender for actors and other members of aproduction team found in a movie or story are identified.

Psycholinguistic normatives can be used to provide a measure ofemotional and psychological constructs of a speaker, thepsycholinguistic normatives being computed entirely from language usage.Examples of emotional and psychological constructs include, but are notlimited to arousal, valence, concreteness, and intelligibility. In arefinement, normative score for each of the psycholinguistic normativesis extrapolated from a small set of keywords which are annotated bypsychologists, the normative score being computed on content words fromeach dialog. Normatives for an input word are numeric scores determinedby linear regression as set forth below in more detail.

In a variation, the portrayal differences are measured by LinguisticInquiry and Word Counts tool (LIWC) which provide a measure of aspeaker's affinity to different predetermined social and physicalconstructs, processes raw text and outputs percentage of words from thescript that belong to a predetermined dimension. The predetermineddimension includes a dimension selected from the group consisting oflinguistic, affective, and perceptual constructs.

With reference to FIG. 1A, a variation of the linguistic representationsused in some methods of the invention is provided. As depicted in Box10, one or more scripts are obtained and then subjected to analysis todetermine character demographics (Box 12) and affective constructs (Box14). The character demographics is performed by applying a genderpredictor (Box 16) and from metadata associated with a script obtainedfrom a movie database such as IMDb (Box 18). Examples of such metadatainclude year of release, directors, writers, producers, performers, andother creators of the content, and combinations thereof. Affectiveconstructs (e.g., related to emotions) are formed by applyingpsycholinguistic normatives (Box 20) and LIWC metrics (Box 22). Thecharacter demographics and affective constructs are collectivelycombined to determine affect and character interaction metrics (Box 24).These affect/character interaction metrics can in turn be used forevaluating character portrayal (Box 26, e.g., the Bechdel Test forevaluating the portrayal of women in movies). The affect/characterinteraction metrics can also be used to determining or predict impact ofa movie (Box 28). The affect/character interaction metrics can also beused to provide personalized content (Box 30). For example, arecommender system can be used to suggest content to a user.

With reference to FIG. 1B, a variation of the linguistic representationsused in some methods of the invention is provided. As depicted in Box32, a script parser acts on a script (Box 34) and data from a moviedatabase (Box 36), The script parser outputs character names anddialogues. The script parser act on the extracted character names (Box38). The script parser determines a character's age from a scriptdatabase (Box 40). A name-based gender classifier (Box 42) can be usedto determine gender from the character name. In a refinement, thename-base classifier used metadata from a script database as set forthabove. A race for a character can be assigned for example from knowledgeof the actor's race or the intended race of the character (Box 44). Thedialogues extracted from the script parser are then process as indictedin box 46 by a content analytic module which determines affectivedimensions (Box 48) (e.g., parameters that describe emotion optionallywith a creating a scale for quantification) and character networkmetrics (Box 50) (e.g., centrality as set forth below in more detail).The output from the metadata extractor and the content analytic moduleare then subjected to be operated on by a linguist analyzer whichconnect the linguistic representations to higher order representationsand mental states and representations of behavior and action, analyzesinterplay between language constructs and demographics of contentcreators and adapts content representations towards individuals/groupsto reflect heterogeneity in preferences.

In another variation, a network structure of interactions betweencharacters is constructed using importance measures for each character(see, FIG. 2). A script (Box 58) is used to form a network structure(e.g., a graph). In box 60, the network structure is formed byconstructing an undirected and unweighted graph where nodes representcharacters, placing an edge e_(ab) to represent interactions between twocharacters in terms of quality and quantity, and analyzing properties ofa node and/or edge statically or over time. The taxonomy of mediacontent is determined using character temporal and global networkstructures. In a general, taxonomy refers to the classification (i.e.,type of script or movie) of the media content, e.g., genre such asaction movie, drama, SciFi, and the like. In particular, the networkstructures can be used to determine demographic interactions(conditioned) (Box 62), impact (Box 64), content personalization (Box66), and/or narrative structure analytics (Box 68). In a variation, userpersonalizations such as online recommendation systems can be determinedfrom the network structure. Personalizations can further be determinedfrom using networks conditioned on demographics.

In a refinement, an edge e_(ab) can be placed if two characters A and Binteract at least once in the script wherein characters A and Binteracts at least one scene in which one speaks right after another. Inthis regard, betweenness centrality can be employed where “betweenness”centrality is the number of shortest paths that go through a node. In afurther refinement, degree centrality is employed, degree centralitybeing the number of edges incident on a node.

In a refinement, the quantity of access interactions between charactersdetermines weight of an edge as number of dialogues, words exchanged,and other nonverbal cues exchanged. In another refinement, the qualityof access interactions between characters determines the weight of anedge from linguistic representations. For example, if the edges areweighted using valence higher edge weights indicate more positivecharacter interactions. The quality of access interactions betweencharacters can also be determined using weight of an edge frominteractions is conditioned on predetermined features such asdemographics, education level, and combinations thereof. In arefinement, the effects of addition, deletion, and/or substitution ofnodes and/or edges in disrupting a movie or story plot can be evaluated.Measures of a node's importance as proxy for a character's importancecan also be estimated (see, FIG. 3). In this regard, adding/deleting anew node amounts to adding or deleting a character in the plot and hencecan be evaluated subjectively by observing the impact on the storyline.Node importance is estimated using predefined and established metricsfrom graph theory.

FIG. 3 illustrates the incorporation of the graph in performing thelinguistic analysis. A plot (Box 70) is incorporated into a script (Box72). This script is then subjected to two analyses—character demographicanalysis (Box 74) and determination of the relative importance of thecharacters (Box 80). The character demographics can be determined from agender predictor (Box 76) and from data obtained from a script database(Box 78) as set forth above. The metrics obtained from box 76 and box 80can then be used to change character demographics (Box 82) and to updatethe script (Box 84) to meet a predetermined or desired policy.

In a variation, a taxonomy of media content is determined from thenetwork structure conditioned on demographics.

In each of the methods set forth above, societal impact, commercialimpact, policy impact, voting impact, buying impact, and combinationsthereof can be determined and evaluated. For example, structure anddemographics of narratives in movies or adverts may be related to boxoffice or product sales. Structure of political narratives may berelated to outcome of elections. In a particularly importantapplication, conscious or subconscious biases that may have beenintroduced in the screenplay writing process are corrected by themethods set forth herein. For example, there may be deviations inrepresentations of demographics such as gender and race in comparisonwith general population which may be corrected before/during the castingprocess using the methods and computer system set forth herein. Anotherapplication is connecting representations to impact; for example, theaverage user rating or box office collections of a movie maybecorrelated to specific representations and portrayals in the movie. Thismay in turn be useful—for a content producer for example—toautomatically filter movies that deviate substantially from knownrepresentation profiles.

With reference to FIG. 4, a schematic illustration of a computer systemimplementing the methods set forth above is provided. Computer system 90includes computer processor 92 (e.g., microprocessor) that executes one,several, or all of the steps of the method. It should be appreciatedthat virtually any type of computer processor may be used, includingmicroprocessors, multicore processors, and the like. The steps of themethod typically are stored in computer memory 94 and accessed bycomputer processor 92 via connection system 96. In a variation,connection system 96 includes a data bus. In a refinement, computermemory 94 includes a computer-readable medium which can be anynon-transitory (e.g., tangible) medium that participates in providingdata that may be read by a computer. Specific examples for computermemory 94 include, but are not limited to, random access memory (RAM),read only memory (ROM), hard drives, optical drives, removable media(e.g. compact disks (CDs), DVD, flash drives, memory cards, etc.), andthe like, and combinations thereof. In another refinement, computerprocessor 92 receives instructions from computer memory 94 and executesthese instructions, thereby performing one or more processes, includingone or more of the processes described herein. Computer-executableinstructions may be compiled or interpreted from computer programscreated using a variety of programming languages and/or technologiesincluding, without limitation, and either alone or in combination, Java,C, C++, C#, Fortran, Pascal, Visual Basic, Java Script, Perl, PL/SQL,etc. Display 98 is also in communication with computer processor 92 viaconnection system 96. Electronic device 90 also optionally includesvarious in/out ports 100 through which data from a pointing device maybe accessed by computer processor 92. Examples for the electronicdevices include, but are not limited to, desktop computers, smartphones, tablets, or tablet computers. In a refinement, neural networkscan be used to perform the methods set forth above.

The following examples illustrate the various embodiments of the presentinvention. Those skilled in the art will recognize many variations thatare within the spirit of the present invention and scope of the claims.

Data

Raw Screenplay

Movie screenplay files were fetched from two primary sources: imsdb(IMSDb, 2017) and daily scripts (DailyScript, 2017). In total, 1547movies were retrieved. After removing duplicates, 1434 raw screenplayfiles were retained, of which 489 were corrupted or empty leaving uswith 945 usable screenplays. Tables 1, 2, and 3 list statistics aboutthe corpus.

Script Parser

The screenplay files are formatted in human readable format and includedialogues tagged with character names along with auxiliary informationof the scene such as shot location (interior/exterior), characterplacement and scene context. The screenplays are from a diverse set ofwriters and include a significant amount of noise and inconsistencies intheir structure. To extract the relevant information, a text parser(bitbucket.org/anil_ramakrishna/scriptparser) was developed that acceptsraw script files and outputs utterances along with character names.Scene context information is ignored and primarily focus on spokendialogues to study language usage in the movies.

Movie and Character Meta-Data

For each parsed movie, relevant meta-data such as year of release,directors, writers, and producers is fetched from the Internet MovieDatabase (IMDb, 2017).

Since most screenplays are drafts and subject to revisions such aschanges in character names, matching them to an entry from IMDb isnon-trivial. The process commences with a list of all movies that have aclose match with the screenplay name; given this list of potentialmatches, name alignment scores are computed for each entry as thepercentage of character names from the script found online. Thecharacter names are mapped using term frequency-inverse documentfrequency (TFIDF) to compute the name alignment score following (Cohenet al., 2003, the entire disclosure of which is hereby incorporated byreference). Finally, the entry with highest alignment score is chosen.For all actors listed in the aligned result, their age, gender and raceare collected as detailed below.

Gender

Given the names of actors and other members of production team found ina movie, a name-based gender classifier is used to predict their genderinformation. Table 4 lists statistics on gender ratios for theproduction team in the corpus. Female-to-male ratios were found in closeagreement with previous works (Smith et al., 2014).

As mentioned above, several screenplays get revised during production.In particular, character names get changed, sometimes even gender. As aresult, some characters may not be aligned to the correct entry fromIMDb. In addition, digitized screenplays sometime include significantnoise thanks to optical character recognition errors, leading tocharacter names failing to align with entries from IMDb. To correctthese, manual cleanup of all the movie alignments, fix incorrect gendermaps, and manually force match movies if they're mapped to the wrongIMDb entry is performed.

Age

The age for each actor is extracted to study possible age-related biasesin movies. Age is included in the analysis since studies reportpreferential biases with age in employment particularly when combinedwith gender (Lincoln and Allen, 2004). In addition, there may be biasesin portrayal of specific age groups when combined with gender and race.

For each actor in the mapped IMDb entry, his/her birthday information isalso collected. The movie production year obtained also from IMDb issubtracted from the actor's birthday to get an estimate of the actor'sage during the movie's production. However, it is noted that the ageobtained in this manner may be different from the portrayed age of thecharacter. To account for this, the actors are binned into fifteen-yearage groups before the analysis, since its generally unlikely to haveactors further than fifteen years from their portrayed age.

Race

Ethnicity information is parsed from the website (ethnicelebs.com,2017), which includes ethnicity for approximately 8000 different actors.The information obtained from this site is primarily submitted byindependent users and exhibits significant amount of variation among thepossible ethnicities with about 750 different unique ethnicity types.Since racial representations are more interesting, the ethnicity typesis mapped to race using Amazon Mechanical Turk (MTurk). A modifiedversion of the racial categories from the US census which are listed inTable 1 along with frequency of actors from each racial category in thecorpus is used.

The ethnicities obtained from the site above primarily cover majoractors with a fan base with no information for several actors who playminor roles. Racial information for nearly 2000 such actors is annotatedusing MTurk with two annotations for each actor, manually correctingnearly 400 cases in which the annotators disagreed.

TABLE 1 Racial categories Race # Actors Percentage African 585 7.44%Caucasian 6539 83.24%  East Asian 73 0.93% Latino/Hispanic 161 2.05%Native American 15 0.19% Pacific Islander 5 0.063%  South Asian 430.547%  Mixed 434 5.52%

Experiments

Character Portrayal Using Language

To study differences in portrayal of characters, two different metricswere used: psycholinguistic normatives, which are designed to capturethe underlying emotional state of the speaker; and LIWC metrics, whichprovide a measure of the speaker's affinity to different social andphysical constructs such as religion and death. These two metrics areexplained in detail below.

Psycholinguistic Normatives

Psycholinguistic normatives provide a measure of various emotional andpsychological constructs of the speaker, such as arousal, valence,concreteness, intelligibility, etc. and are computed entirely fromlanguage usage. They are relatively easy to compute, provide reliableindicators of the above constructs, and have been used in a variety oftasks in natural language processing such as information retrieval(Tanaka et al., 2013), sentiment analysis (Nielsen, 2011), text-basedpersonality prediction (Mairesse et al., 2007) and opinion mining.

The numeric ratings are typically extrapolated from a small set ofkeywords which are annotated by psychologists. Manual annotations ofword ratings are a laborious process and is hence limited to a fewthousand words (Clark and Paivio, 2004). Automatic extrapolation ofthese ratings to words not covered by the manual annotations can be doneusing structured databases which provide relationships between wordssuch as synonymy and hyponymy (Liu et al., 2014), or using context basedsemantic similarity.

In the analysis set forth below, the model described in (Malandrakis andNarayanan, 2015) is applied. In this model, the authors use linearregression to compute normative scores for an input word w based on itssimilarity to a set of concept words S_(i).

${r(w)} = {\theta_{0} + {\sum\limits_{i}{\theta_{i} \cdot {{sim}\left( {w,s_{i}} \right)}}}}$where, r(w) is the computed normative score for word w, θ₀ and θ_(i) areregression coefficients and sim is similarity between the given word wand concept words s_(i).

The concept words can either be hand crafted suitably for the domain orchosen automatically from data. Similar to (Malandrakis and Narayanan,2015), training data was created by posing queries on the Yahoo searchengine from words of the aspell spell checker of which top 500 previewsare collected from each query. From this corpus, the top 10000 mostfrequent words with at least 3 characters were used as concept words inextrapolation of all the norms. The linear regression model is trainedusing normative ratings for the manually annotated words by computingtheir similarity to the concept words. The similarity function sim isthe cosine of binary context vectors with window size 1. The computednormatives are in the range [−1,1].

The psycholinguistic normatives used in the experiments set forth hereinare listed in Table 2. Valence is the degree of positive or negativeemotion evoked by the word. Arousal is a measure of excitement in thespeaker. Valence and arousal combined are common indicators used to mapemotions. Age of Acquisition refers to the average age at which the wordis learned, and it denotes sophistication of language use. GenderLadenness is a measure of masculine or feminine association of a word.10-fold Cross Validation tests are performed on the normative scorespredicted by the regression model given by equation 1. Correlationcoefficients of the selected normatives with the manual annotations areas follows: Arousal (0.7), Valence (0.88), Age of Acquisition (0.86) andGender Ladenness (0.8). The high correlations render confidence in thepsycholinguistic models.

In the experiments set forth below, the normative scores are computed oncontent words from each dialog. All words other than nouns, verbs,adjectives and adverbs were filtered out. Word level scores areaggregated at the dialog level using arithmetic mean.

Linguistic Inquiry and Word Counts (LIWC)

LIWC is a text processing application that processes raw text andoutputs percentage of words from the text that belong to linguistic,affective, perceptual and other dimensions. It operates by maintaining adiverse set of dictionaries of words each belonging to a uniquedimension. Input texts are processed word by word; each word is searchedin the internal dictionaries and the corresponding counter isincremented if a word is found in that dictionary. Finally, percentageof words from the input text belonging to the different dimensions arereturned.

For the experiments, each utterance in the movie is treated as a uniquedocument for which values for the LIWC metrics are obtained. Table 2lists the metrics used in the experiments.

Character Network Analytics

In order to study representation of the different subgroups as majorcharacters in movies, a network of interaction between characters isconstructed using computed importance measures for each character. Fromeach movie script, an undirected and unweighted graph is constructedwhere nodes represent characters. An edge e_(ab) is placed iff twocharacters A and B interact at least once in the movie. For theexperiments, interaction between A and B was assumed if there is atleast one scene in which one speaks right after another. This graphcreation method based on scene co-occurrence is similar to the approachused in (Beveridge and Shan, 2016). Different measures of a node'simportance within the character network were estimated and used as proxyfor the character's importance. Two types of centralities were employed:betweenness centrality which is the number of shortest paths that gothrough the node, and degree centrality, which is the number of edgesincident on a node. These centrality measurements have been previouslyused in the context of books, films and comics (Beveridge and Shan,2016; Bonato et al., 2016; Alberich et al., 2002; Ribeiro et al., 2016).

Results

Differences in various subgroups along multiple facets were studied.Results on differences in character ratios from each subgroup arereported first since this has implications on employment and can havesocial-economic effects (Niven, 2006). Next, psycholinguistic normativesand LIWC metrics described in the previous section are used to studydifferences in character portrayal along the primary markers: age,gender and race. Finally, the graph theoretic centrality measures isused to estimate characters' importance and analyze differences amongthe different subgroups.

Since there is an interest in character level analytics, all utterancesfrom the character are treated as a single document to compute theaggregate language metrics. All of the experiments were performed usingnon-parametric statistical tests since the data fails to satisfypreconditions such as normality and homoscedasticity required forparametric tests such as ANOVA.

TABLE 2 Psycholinguistic Normatives and LIWC metrics used in analysisPsycholinguistic norms Valence, Arousal, Age of Acquisition, GenderLadenness LIWC metrics Achievement, Religion, Death, Sexual, Swear

TABLE 3 Character statistics male female total # Characters 4899 20086907 # Dialogues 375711 154897 530608 945

TABLE 4 Production team statistics role male female total Writers 1326169 1495 Directors 544 46 590 Producers 2866 870 3736 Casting Directors135 275 410 Distributing Companies 2701

TABLE 5 Contingency tables for character gender v/s writers, directorsand casting directors' gender; f: female and m: male; each cell givesfrequency of character gender for that column and production membergender for that row, numbers in braces indicate row wise proportion ofcharacter gender. f (28.9%) m (71.1%) a. writers gender f  249 (41.2%) 356 (58.8%) m 1541 (27.6%) 4040 (72.4%) b. directors gender f  114(39.3%)  176 (60.7%) m 1676 (28.4%) 4220 (71.6%) c. casting directorsgender f 1374 (29.1%) 3350 (70.9%) m  416 (28.5%) 1046 (71.5%)

Difference in Relative Frequency of Subgroups

First, characters were filtered with unknown gender/race/age leaving uswith 6907 characters in total. Table 3 lists the number of charactersand dialogues from each gender. As noted in previous studies, the ratiois considerably skewed with male actors having nearly twice as manyroles and dialogues compared to female actors. Table 4 lists relativefrequency among male and female members of the production team. Table 1lists the percentage of actors belonging to different racial categoriesin the corpus.

Chi-squared tests were performed between character gender and gender ofproduction team members who are most likely to influence charactersgender: writers, directors and casting directors. Table 5 showscontingency tables with gender frequencies for each of these cases alongwith percentages. Note, nearly 100 movies were filtered out for thistest in which the gender of the production team members was unknown. Ofthe three tests performed, character gender distributions for writer anddirector genders are significantly different from the overall charactergender distribution (p<10⁻¹⁰ and p<10⁻⁴ respectively; α=0.05). Inparticular, female writers and directors appear to produce movies withrelatively balanced gender proportions (still slightly skewed towardsthe male side) compared to male writers and directors. Casting directorshowever appear to have no influence on gender of the characters.

Studies report potential biases in actor employment with age (Lincolnand Allen, 2004), particularly in female actors. To evaluate this,histograms of age for male and female characters for each of the racialcategories in FIG. 5 are plotted. The distribution of age for eachcategory appears approximately normal, except for the Native Americanand pacific islander character groups which have a small sample size.For most categories of race, the mode of the distribution for femaleactors appears to be at least five years less than the mode for maleactors. To check for significance in this difference, Mann-Whitney Utests were conducted on male and female age groups for each race withthe resulting p-values shown in the figure. Characters belonging to thepacific islander racial group were ignored since there are no femaleactors from this race in the corpus. The difference in age groups issignificant in most categories with large sample sizes, suggestingpossible preferences towards casting younger people when casting femaleactors.

Character Portrayal Using Language

To analyze differences in portrayal of subgroups, psycholinguisticnormatives and LIWC metrics are computed as described before. For eachof the metrics listed in Table 2, non-parametric hypothesis tests areconducted to look for differences in samples from the subgroups. Thedifferent metrics are treated independently, performing statisticaltests along each separately. Statistical tests combining two or morefactors was avoided since some of the resulting groups would be emptydue to the skewed group sizes along race.

TABLE 6 Median values for male and female characters along with p valuesobtained by comparing the two groups using Mann-Whitney U test;highlighted differences are significant at α = 0.05. m (4894) f (2008) page of acq. −0.1590 −0.1715 <10 ⁻⁵ arousal 0.0253 0.0246 0.41 gender−0.0312 −0.0055 <10 ⁻⁵ valence 0.2284 0.2421 <10 ⁻⁵ sex 0.00015 0.00000.08 achieve 0.0087 0.0080 <10 ⁻⁵ religion 0.0025 0.0022 0.10 death0.0025 0.0016 <10 ⁻⁵ swear 0.0037 0.0015 <10 ⁻⁵

Gender

Mann-Whitney U tests was performed between male and female charactersalong the nine dimensions and the results are shown in Table 6. In allof the cases, higher values imply higher degree of the correspondingdimension, except for valence in which higher values imply positivevalence (attractiveness) and lower values imply negative valence(averseness). The difference between male and female characters arestatistically significant along six of the nine dimensions. The resultsindicate slightly higher age of acquisition scores for male characters.Regarding gender ladenness, male characters appear to be closer to themasculine side than female characters on average, agreeing with previousresults.

The results also indicate that female character utterances tend to bemore positive in valence compared to male characters while malecharacters seem to have higher percentage of words related toachievement. In addition, male characters appear to be more frequent inusing words related to death as well as swear words compared to femalecharacters.

Race

To study differences in portrayal of the racial categories, aKruskal-Wallis test (a generalization of Mann-Whitney U test for morethan two groups) was performed on each of the nine metrics with race asthe independent variable. Significant differences were found indistribution of samples for gender ladenness, sexuality, religion andswear words. For gender ladenness, Caucasian and mixed-race charactershave significantly higher medians than African and Native Americancharacters. In sexuality, Latino and mixed-race characters were found tohave higher median than at least one other racial group withsignificance indicating a higher degree of sexualization in thesecharacters. East Asian characters were found to be significantly lowerthan medians of three other races (Caucasian, African and mixed) inusing words with religious connotations. In swear word usage, the onlysignificant difference found is between Caucasian and African characterswith African characters using higher percentage of swear words. In allof the above cases, significance was tested at α=0.05

TABLE 7 Coefficients of age for linear regression models along eachdimension along with p-values; highlighted cells are significant at α =0.05 β₁(×10⁻³) p-value age of acq. 3.9 <10⁻¹⁰ arousal −1.1 <10⁻¹⁰ gender−2.5 <10⁻¹⁰ valence 0.078 0.7 sex −0.25 <10⁻⁵  achieve 0.26 <10⁻¹⁰religion 0.12  0.001 death −0.039 0.2 swear −0.34 <10⁻⁵ 

Age

To examine the relationship between age and the different metrics,separate linear regression models were built with each dimension as thedependent variable and character age as the independent variable. Table7 reports regression coefficients for age along with p values for eachdimension. The positive coefficient for age of acquisition indicates anincrease in sophistication of word usage with age. Arousal, on the otherhand, has a significant negative coefficient indicating a decrease inactivation, on average, as character age increases. Gender ladennessalso has a significant negative coefficient indicating that as ageincreases, the average gender ladenness value decreases. Similar trendsare observed for sexuality and swear word usage. Usage of words relatedto achievement and religion however, seem to increase with age.

Character Network Analytics

To study differences in major roles assigned to the different subgroups,two centrality metrics were computed from the character network graphconstructed for each movie: degree centrality measures the number ofunique characters that interact with a given character, betweennesscentrality measures how much would the plot be disrupted if saidcharacter was to disappear completely, i.e., how important is acharacter to the overall plot. Similar to the language analyses fromprevious section, differences in these metrics were tested along thethree factors of gender, race and age. All statistical tests reportedbelow are conducted at α=0.05.

Gender

Male characters were found to have higher values in the two metricscompared to female characters but the differences were not statisticallysignificant. Motivated by studies (Sapolsky et al., 2003; Linz et al.,1984) which report interactions between genre and gender, Mann-Whitney Utests were performed between male and female characters given differentgenres. To avoid type I errors, multiple comparisons were corrected forusing the Holm-Bonferroni correction. Significant differences were foundonly in horror movies where the median degree centrality for females(0.221) was higher than the median degree centrality of males (0.166).This is in agreement with prior studies which report female charactersto have a more prominent presence in horror movies, particularly asvictims of violent scenes (Welsh and Brantford, 2009).

Race

To examine differences in major roles across the racial categories,Kruskal-Wallis tests similar to previous subsection were performed.Significant differences were found with both degree and betweennesscentrality measures (p<0.001; α=0.05).

Latino characters were found to have significantly lower degreecentralities compared to Caucasian and south Asian races suggestingnon-central roles in these characters. Caucasian characters were foundto have median betweenness centralities significantly higher than atleast one other race. Characters from the Native American race exhibitsignificantly lower medians in both degree and betweenness centralitiesthan Caucasian, African and mixed characters, which agrees with(Rosenthal, 2012).

Age

The effects of age on importance of character roles were investigated bybuilding a linear regression model on the two centralities with age asthe independent variable. In both cases, age was found to be significant(p<0.001; α=0.05). With degree centrality, the regression coefficient βwas found to be equal to 0.003. In betweenness centrality, theregression coefficient was also positive, given by β=8.41×10⁻⁴. Boththese metrics indicate a positive correlation for character importancewith age, i.e. as characters age, there is an increased interaction withother characters in the movie as well as higher prominence in the movieplot.

CONCLUSION

The embodiments and examples set forth above present a scalableautomated analysis of differences in character portrayal along multiplefactors such as gender, race and age using word usage, psycholinguisticand graph theoretic measures. Several interesting patterns are revealedin the analysis. In particular, movies with female writers and directorsin the production team are observed to have balanced gender ratios incharacters compared to male writers/directors. Across several races,female actors are found to be younger than male actors on average.

Female characters appear to be more positive in language use with fewerreferences to death and fewer swear words compared to male characters.Female characters also appear to be more prominent in horror moviescompared to male characters. Latino and mixed-race characters appear tohave higher usage of sexual words. East Asian characters seem to usesignificantly fewer religious words. As characters aged, their wordsophistication seems to increase along with usage of words related toachievement and religion; there was also a significant reduction in wordactivation, usage of sexual and swear words as character age increases.

Future work includes expanding the analyses to non-English movies andcombining the linguistic metrics with character networks. Specifically,character network edges can be weighted using the psycholinguisticmetrics to analyze the emotional patterns in inter-characterinteractions

While embodiments of the invention have been illustrated and described,it is not intended that these embodiments illustrate and describe allpossible forms of the invention. Rather, the words used in thespecification are words of description rather than limitation, and it isunderstood that various changes may be made without departing from thespirit and scope of the invention.

REFERENCES

-   Ricardo Alberich, Joe Miro-Julia, and Francesc Rossello. 2002.    Marvel universe looks almost like a real social network. arXiv    preprint condmat/0202174.-   Elizabeth Behm-Morawitz and Dana E Mastro. 2008. Mean girls? the    influence of gender portrayals in teen movies on emerging adults'    gender-based attitudes and beliefs. Journalism & Mass Communication    Quarterly 85(1):131-146.-   Harry M Benshoff and Sean Griffin. 2011. America on film:    Representing race, class, gender, and sexuality at the movies. John    Wiley & Sons.-   Andrew Beveridge and Jie Shan. 2016. Network of thrones. Math    Horizons 23(4):18-22.-   Anthony Bonato, David Ryan D'Angelo, Ethan R Elenberg, David F    Gleich, and Yangyang Hou. 2016. Mining and modeling character    networks. In Algorithms and Models for the Web Graph: 13th    International Workshop, WAW 2016, Montreal, QC, Canada, Dec. 14-15,    2016, Proceedings 13. Springer, pages 100-114.-   Gavin S Cape. 2003. Addiction, stigma and movies. Acta Psychiatrica    Scandinavica 107(3):163-169. James M Clark and Allan Paivio. 2004.    Extensions of the paivio, yuille, and madigan (1968) norms. Behavior    Research Methods, Instruments, & Computers 36(3):371-383.-   William Cohen, Pradeep Ravikumar, and Stephen Fienberg. 2003. A    comparison of string metrics for matching names and records. In Kdd    workshop on data cleaning and object consolidation. volume 3, pages    73-78.-   DailyScript. 2017. The daily script. [Online; accessed 1 Feb. 2017].    http://dailyscript.com/.-   Paul G Davies, Steven J Spencer, and Claude M Steele. 2005. Clearing    the air: identity safety moderates the effects of stereotype threat    on women's leadership aspirations. Journal of personality and social    psychology 88(2):276.-   Tony Dimnik and Sandra Felton. 2006. Accountant stereotypes in    movies distributed in north America in the twentieth century.    Accounting, Organizations and Society 31(2):129-155.-   Alice H Eagly and Steven J Karau. 2002. Role congruity theory of    prejudice toward female leaders. Psychological review 109(3):573.    ethnicelebs.com. 2017. Celebrity ethnicity. [Online; accessed 1 Feb.    2017]. http://ethnicelebs.com.-   Louis August Gottschalk and Goldine C Gleser. 1969. The measurement    of psychological states through the content analysis of verbal    behavior. Univ of California Press.-   Mark Hedley. 1994. The presentation of gendered conflict in popular    movies: Affective stereotypes, cultural sentiments, and men's    motivation. Sex Roles 31(11-12): 721-740.-   Bell Hooks. 2009. Reel to real: race, class and sex at the movies.    Routledge. IMDb. 2017. Internet movie database. [Online; accessed 1    Feb. 2017]. http://www.imdb.com/. IMSDb. 2017. Internet movie script    database. [Online; accessed 1 Feb. 2017]. http://www.imsdb.com/.-   Anne E Lincoln and Michael Patrick Allen. 2004. Double jeopardy in    hollywood: Age and gender in the careers of film actors, 1926-1999.    In Sociological Forum. Springer, volume 19, pages 611-631.-   Daniel Linz, Edward Donnerstein, and Steven Penrod. 1984. The    effects of multiple exposures to filmed violence against women.    Journal of Communication 34(3): 130-147.-   Ting Liu, Kit Cho, George Aaron Broadwell, Samira Shaikh, Tomek    Strzalkowski, John Lien, Sarah M Taylor, Laurie Feldman, Boris    Yamrom, NickWebb, et al. 2014. Automatic expansion of the mrc    psycholinguistic database imageability ratings. In LREC. pages    2800-2805.-   Francois Mairesse, Marilyn A Walker, Matthias R Mehl, and Roger K    Moore. 2007. Using linguistic cues for the automatic recognition of    personality in conversation and text. Journal of artificial    intelligence research 30:457-500.-   Nikolaos Malandrakis and Shrikanth S Narayanan. 2015. Therapy    language analysis using automatically generated psycholinguistic    norms. In INTERSPEECH. pages 1952-1956.-   Finn Amp Nielsen. 2011. A new anew: Evaluation of a word list for    sentiment analysis in microblogs. arXiv preprint arXiv: 1103.2903.-   David Niven. 2006. Throwing your hat out of the ring: Negative    recruitment and the gender imbalance in state legislative candidacy.    Politics & Gender 2(04):473-489. NYFA. 2013. Gender inequality in    film. [Online; accessed 1 Feb. 2017].    https://www.nyfa.edu/film-school-blog/genderinequality-in-film/.-   James W Pennebaker, Ryan L Boyd, Kayla Jordan, and Kate    Blackburn. 2015. The development and psychometric properties of    liwc2015. Technical report.-   James W Pennebaker, Matthias R Mehl, and Kate G Niederhoffer. 2003.    Psychological aspects of natural language use: Our words, our    selves. Annual review of psychology 54(1):547-577.-   Polygraph. 2016. Film dialogue from 2,000 screenplays, broken down    by gender and age. [Online; accessed 1 Feb. 2017].    http://polygraph.cool/films/.-   Anil Ramakrishna, Nikolaos Malandrakis, Elizabeth Staruk, and    Shrikanth S Narayanan. 2015. A quantitative analysis of gender    differences in movies using psycholinguistic normatives. In EMNLP.    pages 1996-2001.-   Mauricio Aparecido Ribeiro, Roberto Antonio Vosgerau, Maria Larissa    Pereira Andruchiw, and Sandro Ely de Souza Pinto. 2016. The complex    social network of the lord of rings. Revista Brasileira de Ensino de    Fisica 38(1).-   Nicolas G Rosenthal. 2012. Reimagining Indian country: Native    American migration and identity in twentieth-century Los Angeles.    Univ of North Carolina Press.-   Burry S Sapolsky, Fred Molitor, and Sarah Luque. 2003. Sex and    violence in slasher films: Reexamining the assumptions. Journalism &    Mass Communication Quarterly 80(1):28-38.-   Stacy L Smith, Marc Choueiti, and Katherine Pieper. 2014. Gender    bias without borders: An investigation of female characters in    popular films across 11 countries. USC Annenberg 5.-   Shinya Tanaka, Adam Jatowt, Makoto P Kato, and Katsumi Tanaka. 2013.    Estimating content concreteness for finding comprehensible    documents. In Proceedings of the sixth ACM international conference    on Web search and data mining. ACM, pages 475-484.-   Tom F M Ter Bogt, Rutger CME Engels, Sanne Bogers, and Monique    Kloosterman. 2010. “shake it baby, shake it”: Media preferences,    sexual attitudes and gender stereotypes among adolescents. Sex Roles    63(11-12): 844-859.-   Danny Wedding and Mary Ann Boyd. 1999. Movies & mental illness:    Using films to understand psychopathology.-   Andrew Welsh and Laurier Brantford. 2009. Sex and violence in the    slasher horror film: A content analysis of gender differences in the    depiction of violence. Journal of Criminal Justice and Popular    Culture 16(1):1-25.-   Bo Xiao, Zac E Imel, Panayiotis G Georgiou, David C Atkins, and    Shrikanth S Narayanan. 2015. “rate my therapist”: Automated    detection of empathy in drug and alcohol counseling via speech and    language processing. PloS one 10(12): e0143055.

What is claimed is:
 1. A computer system including a computer processor,the computer processor operable to: receive at least one or a pluralityof narrative files formatted in human readable format, each narrativefile including a script and/or dialogues, the script and/or dialoguesbeing tagged with character names along with auxiliary information, eachscript and/or dialogues including a plurality of portrayals performed byan associated actor; determine linguistic representations of content ofthe narrative files in both abstract and semantic forms; connect thelinguistic representations to higher order representations and mentalstates; connect the linguistic representations to behavior and action;analyze interplay between language constructs and demographics ofcontent creators; adapt content representations towardsindividuals/groups to reflect heterogeneity in preferences; parsescreenplay files or other narrative files to extract predeterminedrelevant information to output utterances and character names associatedwith the output utterances; identify similar movies or stories as aplurality of potential matches; form movie or story alignments bycomputing name alignment scores for each match as a percentage ofcharacter names from the narrative files for each of the similar moviesor stories; identify target entries as movies or stories having analignment score higher than a predetermined value; collect demographicsincluding age, gender, sex, education, profession, and race data foreach associated actor; and determine differences in portrayal ofcharacters.
 2. The computer system of claim 1 wherein each narrativefile is a screenplay file or any story document with one or morecharacters, description of their actions, and a narrative thatrepresents their interaction with one another.
 3. The computer system ofclaim 1 wherein multiple levels connect words, emotion, and personalitytraits.
 4. The computer system of claim 1 wherein the behavior andaction include humor, violence, aggression, language sophistication,gender, ladenness of language use, and combinations thereof.
 5. Thecomputer system of claim 1 wherein character names are mapped by termfrequency-inverse document frequency to compute the name alignmentscores.
 6. The computer system of claim 1 wherein similar movies areidentified that have a close match with a predetermined screenplay name.7. The computer system of claim 1 wherein the computer processor isfurther operable to determine biases in the portrayals with respect toage, gender, and race.
 8. The computer system of claim 1 wherein theauxiliary information includes shot location (interior/exterior),character placement and scene context.
 9. The computer system of claim 1wherein each screenplay file or other narrative file are from a diverseset of writers and include a significant amount of noise andinconsistencies in their structure.
 10. The computer system of claim 1wherein the computer processor is further operable to fetch metadata foreach parsed movie.
 11. The computer system of claim 10 wherein themetadata is selected from the group consisting of year of release,directors, writers, producers, performers, and other creators of contentand combinations thereof.
 12. The computer system of claim 1 wherein thecomputer processor is further operable to identify a gender for actorsand other members of a production team found in a movie.
 13. Thecomputer system of claim 1 wherein portrayal differences are measured bypsycholinguistic normatives that capture an underlying emotional stateof a speaker, the psycholinguistic normatives providing a measure ofemotional and psychological constructs of a speaker, thepsycholinguistic normatives being computed entirely from language usage.14. The computer system of claim 13 wherein emotional and psychologicalconstructs include arousal, valence, concreteness, and intelligibility.15. The computer system of claim 14 wherein a normative score for eachof the psycholinguistic normatives is extrapolated from a small set ofkeywords which are annotated by psychologists, the normative score beingcomputed on content words from each dialog.
 16. The computer system ofclaim 15 wherein the normative score for an input word is determined bylinear regression fromr(w)=θ₀+Σθ_(i)·sim(w,s _(i))  (1) where: w is an input the input word;si is a concept word; r(w) is a computed normative score for word w; andθ₀ and θ_(i) are regression coefficients; and sim is similarity betweeninput word w and concept words si.
 17. The computer system of claim 1wherein portrayal differences are measured by Linguistic Inquiry andWord Counts tool (LIWC) which provide a measure of a speaker's affinityto different predetermined social and physical constructs, processes rawtext and outputs percentage of words from a script that belong to apredetermined dimension.
 18. The computer system of claim 17 wherein thepredetermined dimension includes a dimension selected from the groupconsisting of linguistic, affective, and perceptual constructs.
 19. Thecomputer system of claim 18 wherein degree centrality is employed,degree centrality being the number of edges incident on a node.
 20. Thecomputer system of claim 1 wherein the computer processor is furtheroperable to: construct a network structure of interactions betweencharacters where importance measures for each character are computed byconstructing an undirected and unweighted graph where nodes representcharacters; placing an edge e_(ab) to represent interactions between twocharacters in terms of quality and quantity; and analyzing properties ofa node and/or edge statically or over time; and determining taxonomy ofmedia content using character temporal and global network structures.21. The computer system of claim 20 further comprising creatingpersonalizations from the network structure.
 22. The computer system ofclaim 20 wherein quantity access interaction between characters usingweight of an edge as number of dialogues, words exchanged, and othernonverbal cues exchanged.
 23. The computer system of claim 20 whereinquality access interaction between characters using weight of an edgefrom linguistic representations.
 24. The computer system of claim 20wherein quality access interaction between characters using weight of anedge from interactions conditioned on predetermined features such asdemographics, education level, and combinations thereof.
 25. Thecomputer system of claim 20 further comprising determining the taxonomyof media content from the network structure conditioned on demographics.26. The computer system of claim 20 wherein the computer processor isfurther operable to create personalizations from using networksconditioned on demographics.
 27. The computer system of claim 20 furthercomprising evaluating effects of addition, deletion, and/or substitutionof nodes and/or edges can in disrupting a movie plot.
 28. The computersystem of claim 20 wherein the computer processor is further operable toestimate measures of a node's importance as proxy for a character'simportance.
 29. The computer system of claim 20 further comprisingplacing the edge e_(ab) if two characters A and B interact at least oncein the script and/or dialogues wherein characters A and B interacts atleast one scene in which one speaks right after another.
 30. Thecomputer system of claim 29 wherein betweenness centrality is employed,betweenness centrality being the number of shortest paths that gothrough a node.
 31. The computer system of claim 1 wherein the computerprocessor is further operable to determine societal impact, commercialimpact, policy impact, voting impact, buying impact, and combinationsthereof.